ARIMA can also be used for seasonal data. We write it as \(ARIMA(p,d,q) (P,D,Q)_m\). where m is the number of observations per year. The seasonal part of the model consists of terms that are similar to non seasonal component, but involves backshifts of the seasonal period.

An \(ARIMA(1,1,1)(1,1,1)_4\) model without a constant can be written as \[(1-ϕB)(1-ΦB^4)(1-B)(1-B^4) y_t = (1+ \theta_1B)(1+Θ_1B^4)e_t\]

Finding the seasonal part

The seasonal part can also be seen in ACF/PACF plot.

For \(ARIMA(0,0,1)(0,0,1)_{12}\) model:

  1. a spike at lag 12 in the ACF but no other significant spike.

  2. exponential decay in the seasonal lag in the PACF plot.

For \(ARIMA(1,0,0)(1,0,0)_{12}\) model:

  1. a spike at lag 12 in the PACF but no other significant spike.

  2. exponential decay in the seasonal lag in the ACF plot.

Example: European Quarterly Retail Trade

library(fpp2)
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
autoplot(euretail) + ylab("Retail Index") + xlab("Year") +
  ggtitle("Quarterly Retail Trade Index")

The data are clearly non-stationary with some seasonality present as well. So we will first take a seasonal difference and will see if that makes data stationary.

euretail %>% diff(lag = 4) %>% ggtsdisplay()

The data still looks non stationary. So we will take a first difference as well.

euretail %>%  diff(lag = 4) %>% diff(lag = 1) %>% ggtsdisplay()

Now we will try to find a appropriate ARIMA model based on ACF/PACF plot. A significant spike at lag 1 shows non-seasonal MA(1). A significant spike at lag 4 suggest a seasonal MA(1). Hence, initially we select \(ARIMA(0,1,1)(0,1,1)_4\) model, indicating a seasonal and first difference and non-seasonal and seasonal MA(1).

euretail %>% 
  Arima(order = c(0,1,1), seasonal = c(0,1,1)) %>% 
  residuals() %>% ggtsdisplay()

Now both the ACF and PACF plots are showing a significant spike at lag 2 that means a non seasonal terms is need to be included. Now we will try \(ARIMA(0,1,2)(0,1,1)_4\)

fit_1 = Arima(euretail, order = c(0,1,2), seasonal = c(0,1,1))
summary(fit_1)
## Series: euretail 
## ARIMA(0,1,2)(0,1,1)[4] 
## 
## Coefficients:
##          ma1     ma2     sma1
##       0.2303  0.2502  -0.6991
## s.e.  0.1484  0.1188   0.1284
## 
## sigma^2 estimated as 0.1789:  log likelihood=-32.76
## AIC=73.53   AICc=74.27   BIC=81.84
## 
## Training set error measures:
##                      ME      RMSE       MAE         MPE      MAPE
## Training set -0.0448743 0.3956301 0.2816785 -0.04296557 0.2916949
##                   MASE       ACF1
## Training set 0.2291311 0.06982161
fit_1 %>% residuals() %>% ggtsdisplay()

This time we are getting a spike at 3. Now let’s try \(ARIMA(0,1,3)(0,1,1)_4\)

fit_2 = Arima(euretail, order = c(0,1,3), seasonal = c(0,1,1))
summary(fit_2)
## Series: euretail 
## ARIMA(0,1,3)(0,1,1)[4] 
## 
## Coefficients:
##          ma1     ma2     ma3     sma1
##       0.2630  0.3694  0.4200  -0.6636
## s.e.  0.1237  0.1255  0.1294   0.1545
## 
## sigma^2 estimated as 0.156:  log likelihood=-28.63
## AIC=67.26   AICc=68.39   BIC=77.65
## 
## Training set error measures:
##                       ME      RMSE       MAE         MPE      MAPE
## Training set -0.02965298 0.3661147 0.2787802 -0.02795377 0.2885545
##                   MASE        ACF1
## Training set 0.2267735 0.006455781
fit_2 %>% residuals() %>% ggtsdisplay()

This time the ACF/PACF looks much better, residuals also looks like white noise and also AICc has also improved.

fit_2 %>% forecast(h = 12) %>% autoplot()

The forecast seems to follow the recent trend because of the double differencing.

The large and rapidly increasing prediction intervals show that the retail trade index could start increasing or decreasing at any time — while the point forecasts trend downwards, the prediction intervals allow for the data to trend upwards during the forecast period.

Now we can check if auto.arima function gives the same result or not.

auto.arima(euretail, stepwise = FALSE)
## Series: euretail 
## ARIMA(0,1,3)(0,1,1)[4] 
## 
## Coefficients:
##          ma1     ma2     ma3     sma1
##       0.2630  0.3694  0.4200  -0.6636
## s.e.  0.1237  0.1255  0.1294   0.1545
## 
## sigma^2 estimated as 0.156:  log likelihood=-28.63
## AIC=67.26   AICc=68.39   BIC=77.65

It selects the same model.

Example : Corticosteroid drug sales in Australia

lh02 = log(h02)
cbind("H02 sales (million scripts)" = h02,
      "Log H02 sales" = lh02) %>% 
  autoplot(facets = TRUE) + xlab("Year") + ylab("")

The original time series has small increase in the variance so we have taken log to stabilize it. The data is seasonal in nature and obviously non-stationary, so seasonal differencing will be used.

lh02 %>% diff(lag = 12) %>% 
  ggtsdisplay(xlab = "Year",
              main = "Seasonally differenced H02 scripts")

The residuals are not stationary. PACF plot shows spike at lag 12 and 24 but ACF do not shows any significant lag at seasonal period. This suggest seasonal AR(2) model. PACF plots shows 3 significant spike and ACF do not suggest any simple model. We will start with non-seasonal AR(3). We have \(ARIMA(3,0,0)(2,1,0)_{12}\).

AICC = c()
AICC[1] = Arima(h02, order = c(3,0,0), seasonal = c(2,1,0), lambda = 0)$aicc
AICC[2] = Arima(h02, order = c(3,0,1), seasonal = c(2,1,0), lambda = 0)$aicc
AICC[3] = Arima(h02, order = c(3,0,2), seasonal = c(2,1,0), lambda = 0)$aicc
AICC[4] = Arima(h02, order = c(3,0,1), seasonal = c(1,1,0), lambda = 0)$aicc
AICC[5] = Arima(h02, order = c(3,0,1), seasonal = c(0,1,2), lambda = 0)$aicc
AICC[6] = Arima(h02, order = c(3,0,1), seasonal = c(1,1,1), lambda = 0)$aicc
AICC[7] = Arima(h02, order = c(3,0,1), seasonal = c(0,1,1), lambda = 0)$aicc
AICC
## [1] -475.1230 -476.3086 -474.8826 -463.3951 -485.4754 -484.2511 -483.6674

Of these models, the best is the \(ARIMA(3,0,1)(0,1,2)_{12}\) model.

fit = Arima(h02, order = c(3,0,1), seasonal = c(0,1,2), lambda = 0)
fit %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,1)(0,1,2)[12]
## Q* = 23.663, df = 18, p-value = 0.1664
## 
## Model df: 6.   Total lags used: 24

We still some spike in the ACF plot and the model fails the Ljung-Box test. Now we try auto.arima function.

fit_2 = auto.arima(h02, lambda = 0, stepwise = FALSE)
summary(fit_2)
## Series: h02 
## ARIMA(2,1,0)(0,1,1)[12] 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2     sma1
##       -0.8491  -0.4207  -0.6401
## s.e.   0.0712   0.0714   0.0694
## 
## sigma^2 estimated as 0.004387:  log likelihood=245.39
## AIC=-482.78   AICc=-482.56   BIC=-469.77
## 
## Training set error measures:
##                        ME       RMSE        MAE        MPE     MAPE
## Training set -0.003580125 0.05092947 0.03715772 -0.5483892 4.731282
##                   MASE        ACF1
## Training set 0.6129786 0.007272094
checkresiduals(fit_2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)(0,1,1)[12]
## Q* = 31.132, df = 21, p-value = 0.07148
## 
## Model df: 3.   Total lags used: 24

The residuals looks normally distributed but the ACF plot shows some spike. Sometimes it is just not possible to find a model that passes all of the tests. One way to go in this situation is to test all the best models on out of time data and select the one with highest forecast accuracy.

When models are compared using AICc values, it is important that all models have the same orders of differencing. However, when comparing models using a test set, it does not matter how the forecasts were produced — the comparisons are always valid.

Forecasts from the \(ARIMA(3,0,1)(0,1,2)_{12}\) model (which has the lowest RMSE value on the test set, and the best AICc value amongst models with only seasonal differencing.

h02 %>% 
  Arima(order = c(3,0,1), seasonal = c(0,1,2), lambda = 0) %>%
  forecast() %>% 
  autoplot() + 
  ylab("H02 sales (million scripts)") + xlab("Year")

Excercise

  1. Quarterly number of international tourists to Australia
cbind("No. of tourists" = austourists,
      "No. of tourists(log)" = BoxCox(austourists, lambda = BoxCox.lambda(austourists))) %>%
  autoplot(facets = TRUE) + xlab("Years")

The data is seasonal in nature and also there is a strong increasing trend in the data.

ggAcf(austourists)

ggPacf(austourists)

ACF plot shows a curve in each season. This is happening because of seasonality in the data. The PACF plot shows significant spike at lag 4 and 8 also some significant spike till lag 5.

austourists %>% diff(lag = 4) %>% ggtsdisplay()

The data shows spike at lag 4 in both the plots and no other significant spike at seasonal lags. This suggests a Seasonal \(ARIMA(1,1,0)_4\). Also, there is spike at lag1 in both the plot, that suggest a Non-seasonal AR(1). The model we are fitting is \(ARIMA(1,0,0)(1,1,0)_4\).

We can fit this model and check residuals.

aust_arima = Arima(austourists, order = c(1,0,0), seasonal = c(1,1,0))
checkresiduals(aust_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(1,1,0)[4]
## Q* = 11.843, df = 6, p-value = 0.06556
## 
## Model df: 2.   Total lags used: 8
ggPacf(residuals(aust_arima))

The ACF and PACF plot still shows some spike at lag 1. We can check if including drift gives a better result.

aust_arima_drift = Arima(austourists, order = c(1,0,0), seasonal = c(1,1,0), include.drift = TRUE)
checkresiduals(aust_arima_drift)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(1,1,0)[4] with drift
## Q* = 4.0937, df = 5, p-value = 0.536
## 
## Model df: 3.   Total lags used: 8

This looks much better. Let’s see what model is selected from auto.arima.

auto.arima(austourists, stepwise = FALSE)
## Series: austourists 
## ARIMA(1,0,0)(1,1,0)[4] with drift 
## 
## Coefficients:
##          ar1     sar1   drift
##       0.4705  -0.5305  0.5489
## s.e.  0.1154   0.1122  0.0864
## 
## sigma^2 estimated as 5.15:  log likelihood=-142.48
## AIC=292.97   AICc=293.65   BIC=301.6

Auto.arima also selects the same model.

Fitting seasonal ARIMA 11. Total net generation of electricity (in billion kilowatt hours)

autoplot(usmelec, series = "Data") +
  autolayer(ma(usmelec, order = 12, centre = TRUE), series = "2-12 MA") +
  xlab("Year") + ggtitle("Total net generation of electricity")
## Warning: Removed 12 rows containing missing values (geom_path).

There is a increasing trend till year around 2008 and then no increasing trend is visible. The data has increasing variance so we will do some BoxCox transformation.

usmelec_lambda = BoxCox.lambda(usmelec)
autoplot(BoxCox(usmelec, lambda = usmelec_lambda))

The data is stationary so we will take one seasonal difference.

usmelec %>% BoxCox(lambda = usmelec_lambda) %>% diff(lag = 12) %>% ggtsdisplay()

The data do not looks stationary still. We can take first order difference and check again

usmelec %>% BoxCox(lambda = usmelec_lambda) %>% diff(lag = 12) %>% 
  diff(lag = 1) %>% 
  ggtsdisplay()

The data now looks pretty much stationary except there are some peaks peaks and troughs.

ACF plot shows significant spike at lag 12 and no spike at any multiple of that. PACF shows slow decline at seasonal frequency. This suggest seasonal \(ARIMA(0,1,1)_12\).

Both ACF and PACF plot shows spike at lag 1 and 2. We can select non-seasonal ARIMA(0,1,2).

usmelec_arima1 = Arima(usmelec, order = c(0,1,2), seasonal = c(0,1,1), 
                       lambda = usmelec_lambda)
summary(usmelec_arima1)
## Series: usmelec 
## ARIMA(0,1,2)(0,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##           ma1      ma2     sma1
##       -0.4317  -0.2552  -0.8536
## s.e.   0.0439   0.0440   0.0261
## 
## sigma^2 estimated as 1.284e-06:  log likelihood=2544.75
## AIC=-5081.51   AICc=-5081.42   BIC=-5064.87
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.4393834 7.255962 5.331941 -0.1768805 2.012093 0.5921258
##                     ACF1
## Training set -0.02547411
ggtsdisplay(residuals(usmelec_arima1))

auto.arima(usmelec, lambda = usmelec_lambda, stepwise = FALSE)
## Series: usmelec 
## ARIMA(1,1,1)(2,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1
##       0.3888  -0.8265  0.0403  -0.0958  -0.8471
## s.e.  0.0630   0.0375  0.0555   0.0531   0.0341
## 
## sigma^2 estimated as 1.274e-06:  log likelihood=2547.36
## AIC=-5082.72   AICc=-5082.54   BIC=-5057.76

The auto.arima gives a slightly better results.